Welcome to this full write-up forecasting example using the nycflights13 data set. The solutions here will follow the chapters in Forecasting Principles and Practice, 3rd edition, by Rob J. Hyndman, and George Athanasopoulos.
The chapters are:
1. Creating time series objects, plotting and autocorrelation
2. Time Series Decomposition
3. Time Series Features
4. The foreaster’s basic toolbox.
5. Time series regression models
6. Exponential smoothing models
7. ARIMA models
8. Dynamic regression models
9. Forecasting hierarchical and grouped time series
10. Advanced forecasting methods
—
We will begin by importing the required libraries into r.
library(nycflights13)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ lubridate 1.8.0 ✓ feasts 0.2.2
## ✓ tsibble 1.1.0 ✓ fable 0.3.1
## ✓ tsibbledata 0.3.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
library(clock)
##
## Attaching package: 'clock'
## The following object is masked from 'package:lubridate':
##
## as_date
Our first step will be converting the nycflights13 data set into a tsibble, so time series forcasting can be done with the data. This will include setting up a date column to make the analysis much easier and faster
flights <- nycflights13::flights %>%
mutate(date = date_build(year = year, month = month, day = day))
flight1 <- flights %>%
mutate(date = date_build(year = year, month = month, day = day))
flight1 <- flight1 %>%
select(date, dep_time:distance)
The data has a few missing points, so it needs to be cleaned up before we can do full analysis with it. Specifically, it appears there are 8,255 rows with missing departure time, departure delay, arrival time, arrival delay and air time. The most likely explanation is that these flights were scheduled, but did not fly. Let’s remove these rows from our data set.
# 8,255 rows with missing departure times
flight1 %>% filter(is.na(dep_time))
## # A tibble: 8,255 × 14
## date dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <date> <int> <int> <dbl> <int> <int>
## 1 2013-01-01 NA 1630 NA NA 1815
## 2 2013-01-01 NA 1935 NA NA 2240
## 3 2013-01-01 NA 1500 NA NA 1825
## 4 2013-01-01 NA 600 NA NA 901
## 5 2013-01-02 NA 1540 NA NA 1747
## 6 2013-01-02 NA 1620 NA NA 1746
## 7 2013-01-02 NA 1355 NA NA 1459
## 8 2013-01-02 NA 1420 NA NA 1644
## 9 2013-01-02 NA 1321 NA NA 1536
## 10 2013-01-02 NA 1545 NA NA 1910
## # … with 8,245 more rows, and 8 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>
# remove rows with missing departure times
flight1 <- flight1 %>%
filter(!is.na(dep_time))
There are still 2,808 NAs in the data set. It appears all these flights did fly. Let’s find these flights and see what we can do to clean everything up:
sum(is.na(flight1)) # 2808
## [1] 2808
# Let's see where the NAs are located:
flight1 %>%
filter(!complete.cases(.))
## # A tibble: 1,175 × 14
## date dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <date> <int> <int> <dbl> <int> <int>
## 1 2013-01-01 1525 1530 -5 1934 1805
## 2 2013-01-01 1528 1459 29 2002 1647
## 3 2013-01-01 1740 1745 -5 2158 2020
## 4 2013-01-01 1807 1738 29 2251 2103
## 5 2013-01-01 1939 1840 59 29 2151
## 6 2013-01-01 1952 1930 22 2358 2207
## 7 2013-01-01 2016 1930 46 NA 2220
## 8 2013-01-02 905 822 43 1313 1045
## 9 2013-01-02 1125 925 120 1445 1146
## 10 2013-01-02 1848 1840 8 2333 2151
## # … with 1,165 more rows, and 8 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>
# Just three columns contain all 2,808 NAs, so we'll build special data sets when we are making models to predict those values, otherwise we'll use the full data set when we create models to predict the values where we do have full data (such as predicting departure time):
sum(is.na(flight1$arr_time), is.na(flight1$arr_delay), is.na(flight1$air_time)) # 2,808
## [1] 2808
Just remember, when building models that include arrival time, arrival delay, and air time, to remove the rows with NAs first.
flight1 <- flight1 %>%
group_by(date) %>%
summarise(
flights = n(),
mean_dep_time = mean(dep_time),
mean_dep_delay = mean(dep_delay),
mean_arrival_time = mean(arr_time, na.rm = TRUE),
mean_arrival_delay = mean(arr_delay, na.rm = TRUE),
mean_air_time = mean(air_time, na.rm = TRUE),
mean_distance = mean(distance),
key = 100
) %>%
ungroup()
# Determine most common destination per day
mcd <- flights %>%
select(date, dest) %>%
group_by(date, dest) %>%
summarise(most_common_destination = n()) %>%
arrange(desc(most_common_destination)) %>%
filter(row_number() == 1) %>%
arrange(date)
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
flight1 <- flight1 %>%
mutate(most_common_destination = mcd$dest)
# Determine most common carrier per day
mcc <- flights %>%
select(date, carrier) %>%
group_by(date, carrier) %>%
summarise(most_common_carrier = n()) %>%
arrange(desc(most_common_carrier)) %>%
filter(row_number() == 1) %>%
arrange(date)
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
flight1 <- flight1 %>%
mutate(most_common_carrier = mcc$carrier)
# Determine most common origin per day
mco <- flights %>%
select(date, origin) %>%
group_by(date, origin) %>%
summarise(most_common_origin = n()) %>%
arrange(desc(most_common_origin)) %>%
filter(row_number() == 1) %>%
arrange(date)
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
flight1 <- flight1 %>%
mutate(most_common_origin = origin)
flight1 <- flight1 %>%
as_tsibble(index = date, key = key) %>%
arrange(date) %>%
ungroup()
flight1 %>%
autoplot(flights)
flight1 %>%
autoplot(mean_dep_time)
flight1 %>%
autoplot(mean_dep_time) +
geom_line(aes(y = mean_dep_delay))
flight1 %>%
ggplot(aes(x = mean_dep_time, y = mean_dep_delay)) +
geom_point() +
labs(x = "Average departure time",
y = "Average departure delay")
flight1 %>%
GGally::ggpairs()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Discussion: Most correlations are <0.2, but a few are much larger:
The correlation between mean_distance and date = 0.473. This implies the later in the year a person flies, the farther they will travel.
The largest correlation is between mean of arrival delay, and mean of departure delay. That value is 0.944. There is a 94% probability in this data set that if a plane leaves late, it will arrive late.
flight1 %>%
gg_lag(flights, geom = "point") +
labs(x = "lag(flights, k)")
(from the text) Each graph shows \(y_t\) plotted against \(y_{t-k}\) for different values of \(k\)
(from the text) Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.
There are several autocorrelation coefficients, corresponding to each panel in the lag plot. For example, \(r_1\) measures the relationship between \(y_t\) and \(y_{t-1}\), \(r_2\) measures the relationship between \(y_t\) and \(y_{t-2}\), and so on.
flight1 %>% ACF(flights, lag_max = 10)
## # A tsibble: 10 x 3 [1D]
## # Key: key [1]
## key lag acf
## <dbl> <lag> <dbl>
## 1 100 1D 0.222
## 2 100 2D -0.117
## 3 100 3D -0.0607
## 4 100 4D -0.0696
## 5 100 5D -0.120
## 6 100 6D 0.106
## 7 100 7D 0.652
## 8 100 8D 0.0905
## 9 100 9D -0.0937
## 10 100 10D -0.0482
We can view the values in the autocorrelation function:
flight1 %>%
ACF(flights) %>%
autoplot() +
labs(title = "Flights out of New York City in 2013, by day")
The strongest lags are at 7. 14, and 21 days. This is clearly due to the weekly nature of the business. \(r_1\) is strong because the number of flights on any given day is strongly related to the number of flights the previous day.
(from the text) The dashed blue lines indicate whether the correlations are significantly different from zero.
(from the text) When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.
When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags.
If the values of \(r_1\) through \(r_t\) are all within the dotted blue lines, the results show white noise.
(from the text) If we assume an aditive decomposition, then we can write:
where \(y_t\) is the data, \(S_t\) is the seasonal component, \(T_t\) is the trend-cycle component, and \(R_t\) is the remainder component, all at period t. Alternatively, a multiplicative decomposition would be written as:
This will allow us to look at the decomposition of flights in the flight1 data set:
dcmp1 <- flight1 %>%
model(stl = STL(flights))
components(dcmp1)
## # A dable: 365 x 8 [1D]
## # Key: key, .model [1]
## # : flights = trend + season_week + remainder
## key .model date flights trend season_week remainder season_adjust
## <dbl> <chr> <date> <int> <dbl> <dbl> <dbl> <dbl>
## 1 100 stl 2013-01-01 838 852. 34.8 -48.7 803.
## 2 100 stl 2013-01-02 935 856. 40.4 38.7 895.
## 3 100 stl 2013-01-03 904 860. 66.8 -22.8 837.
## 4 100 stl 2013-01-04 909 863. 20.3 25.3 889.
## 5 100 stl 2013-01-05 717 867. -194. 44.3 911.
## 6 100 stl 2013-01-06 831 869. -31.3 -6.97 862.
## 7 100 stl 2013-01-07 930 872. 62.2 -3.91 868.
## 8 100 stl 2013-01-08 895 871. 36.3 -12.1 859.
## 9 100 stl 2013-01-09 897 870. 40.8 -13.7 856.
## 10 100 stl 2013-01-10 929 868. 67.6 -6.26 861.
## # … with 355 more rows
Note that we can do the same for any other variable in our data set, such as mean arrival time:
dcmp2 <- flight1 %>%
model(stl = STL(mean_arrival_time))
components(dcmp2)
## # A dable: 365 x 8 [1D]
## # Key: key, .model [1]
## # : mean_arrival_time = trend + season_week + remainder
## key .model date mean_arrival_time trend season_week remainder
## <dbl> <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 100 stl 2013-01-01 1562. 1546. 5.58 10.5
## 2 100 stl 2013-01-02 1533. 1543. -0.00116 -10.3
## 3 100 stl 2013-01-03 1536. 1540. -1.37 -2.01
## 4 100 stl 2013-01-04 1519. 1537. -45.2 27.6
## 5 100 stl 2013-01-05 1509. 1534. 11.7 -36.4
## 6 100 stl 2013-01-06 1573. 1531. 39.3 2.49
## 7 100 stl 2013-01-07 1516. 1529. -9.28 -3.14
## 8 100 stl 2013-01-08 1534. 1525. 4.76 4.17
## 9 100 stl 2013-01-09 1523. 1521. -0.0914 1.63
## 10 100 stl 2013-01-10 1523. 1519. -1.61 6.31
## # … with 355 more rows, and 1 more variable: season_adjust <dbl>
components(dcmp1) %>%
as_tsibble() %>%
autoplot(flights, colour = "gray") +
geom_line(aes(y = trend), colour = "#D55E00") +
labs(
y = "flights",
title = "Number and trend of flights per day"
)
components(dcmp1) %>% autoplot()
From the text: If the seasonal component is removed from the original data, the resulting values are the “seasonally adjusted” data. For an additive decomposition, the seasonally adjusted data are given by \(y_t - S_t\), and for multiplicative data the seasonally adjusted values are obtained by \(\frac{y_t}{S_t}\)
components(dcmp1) %>%
as_tsibble() %>%
autoplot(flights, colour = "gray") +
geom_line(aes(y = season_adjust), colour = "#0072B2") +
labs(
y = "Flights",
title = "Number and seasonally adjusted number of flights per day"
)
(from the text) The classical method of time series decomposition originated in the 1920s and was widely used until the 1950s. It still forms the basis of many time series decomposition methods, so it is important to understand how it works. The first step in a classical decomposition is to use a moving average method to estimate the trend-cycle, so we begin by discussing moving averages.
A moving average of order \(m\) can be written as:
where \(m = 2k+1\)
Let us calculate and view a 7-day moving average for the flight data. We use a 7 day moving average because our data has weekly seasonality:
flight1 %>%
mutate(
`7-MA` = slider::slide_dbl(flights, mean,
.before = 3, .after = 3, .complete = TRUE)
) %>%
autoplot(flights, colour = "gray") +
geom_line(aes(y = `7-MA`), colour = "#D55E00")
## Warning: Removed 6 row(s) containing missing values (geom_path).
The classical decomposition method originated in the 1920s. It is a relatively simple procedure, and forms the starting point for most other methods of time series decomposition. There are two forms of classical decomposition: an additive decomposition and a multiplicative decomposition. These are described below for a time series with seasonal period \(m = 4\)for quarterly data, \(m = 12\) for monthly data, \(m = 7\) for daily data with a weekly pattern).
If \(m\) is an even number, compute the trend-cycle component, \(\hat{T}^t\), using a 2 x m-MA. if m is an odd number, compute the trend-cycle component, \(\hat{T}^t\), using an m-MA
Calculate the detrended series: \(y_t - \hat{T}^t\)
o estimate the seasonal component for each season, simply average the detrended values for that season. For example, with monthly data, the seasonal component for March is the average of all the detrended March values in the data. These seasonal component values are then adjusted to ensure that they add to zero. The seasonal component is obtained by stringing together these monthly values, and then replicating the sequence for each year of data. This gives \(\hat{S}_t\)
The remainder component is calcluated bu subtracting the estimated seasonal and trend-cycle components: \(\hat{R}_t = y_t - \hat{T}_t - \hat{S}_t\)
flight1 %>%
model(
classical_decomposition(flights, type = "additive")
) %>%
components() %>%
autoplot() +
labs(title = "Classical additive decomposition")
## Warning: Removed 3 row(s) containing missing values (geom_path).
flight1 %>%
model(
classical_decomposition(flights ~ season(12), type = "mult")
) %>%
components() %>%
autoplot() +
labs(title = "Classical multiplicative decomposition")
## Warning: Removed 6 row(s) containing missing values (geom_path).
(from the text) STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess,” while loess is a method for estimating nonlinear relationships. The STL method was developed by R. B. Cleveland et al. (1990).
STL has several advantages over classical decomposition, and the SEATS and X-11 methods:
• Unlike SEATS and X-11, STL will handle any type of seasonality, not only monthly and quarterly data.
• The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
• The smoothness of the trend-cycle can also be controlled by the user.
• It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.
flight1 %>%
model(
STL(flights ~ trend(window = 7) +
season(window = "periodic"),
robust = "TRUE")) %>%
components %>%
autoplot()
The feasts package includes functions for computing FEatures And Statistics from Time Series (hence the name). We have already seen some time series features. For example, the autocorrelations discussed in Section 2.8 can be considered features of a time series — they are numerical summaries computed from the series.
This give the quantiles for the number of flights
flight1 %>% features(flights, sd)
## New names:
## * `` -> ...1
## # A tibble: 1 × 2
## key ...1
## <dbl> <dbl>
## 1 100 98.1
It’s also possible to calculate the min, max, and standard deviation (sd) of the data
(from the text) Autocorrelations were discussed in Section 2.8. All the autocorrelations of a series can be considered features of that series. We can also summarise the autocorrelations to produce new features; for example, the sum of the first ten squared autocorrelation coefficients is a useful summary of how much autocorrelation there is in a series, regardless of lag.
We can also compute autocorrelations of the changes in the series between periods. That is, we “difference” the data and create a new time series consisting of the differences between consecutive observations. Then we can compute the autocorrelations of this new differenced series. Occasionally it is useful to apply the same differencing operation again, so we compute the differences of the differences. The autocorrelations of this double differenced series may provide useful information.
The feat_acf() function computes a selection of the autocorrelations discussed here. It will return six or seven features:
• the first autocorrelation coefficient from the original data;
• the sum of squares of the first ten autocorrelation coefficients from the original data;
• the first autocorrelation coefficient from the differenced data;
• the sum of squares of the first ten autocorrelation coefficients from the differenced data;
• the first autocorrelation coefficient from the twice differenced data;
• the sum of squares of the first ten autocorrelation coefficients from the twice differenced data;
flight1 %>% features(flights, feat_acf)
## # A tibble: 1 × 8
## key acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 season_acf1
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 0.222 0.541 -0.277 0.813 -0.504 1.15 0.652
(from the text) The STL decomposition discussed in Chapter 3 is the basis for several more features.
A time series decomposition can be used to measure the strength of trend and seasonality in a time series. Recall that the decomposition is written as
where \(T_t\) is the smoothed trend component, \(S_t\) is the seasonal component, and \(R_t\) is a remainder component. For strongly trended data, the seasonally adjusted data should have much more variation than the remainder component. Therefore Var(\(R_t\))/Var(\(T_t + R_T)\) should be relatively small. But for data with little or no trend, the two vairances should be approximately the same. So we define the strength of trend as:
This will give a measure of the strength of the trend between 0 and 1. Because the variance of the remainder might occasionally be even larger than the variance of the seasonally adjusted data, we set the minimal possible value of \(F_T\)equal to zero.
The strength of seasonality is defined similarly, but with respect to the detrended data rather than the seasonally adjusted data:
flight1 %>% features(flights, feat_stl)
## # A tibble: 1 × 10
## key trend_strength seasonal_strength_week seasonal_peak_we… seasonal_trough…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 100 0.450 0.743 0 5
## # … with 5 more variables: spikiness <dbl>, linearity <dbl>, curvature <dbl>,
## # stl_e_acf1 <dbl>, stl_e_acf10 <dbl>
feasts package, in one line of codeflight1 %>%
features(flights, feature_set(pkgs = "feasts")) %>%
print(width = 2000)
## Warning: `n_flat_spots()` was deprecated in feasts 0.1.5.
## Please use `longest_flat_spot()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: 1 error encountered for feature 20
## [1] The `fracdiff` package must be installed to use this functionality. It can be installed with install.packages("fracdiff")
## # A tibble: 1 × 48
## key trend_strength seasonal_strength_week seasonal_peak_week
## <dbl> <dbl> <dbl> <dbl>
## 1 100 0.450 0.743 0
## seasonal_trough_week spikiness linearity curvature stl_e_acf1 stl_e_acf10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 369. 197. -410. 0.139 0.270
## acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 season_acf1 pacf5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.222 0.541 -0.277 0.813 -0.504 1.15 0.652 0.0969
## diff1_pacf5 diff2_pacf5 season_pacf zero_run_mean nonzero_squared_cv
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.905 1.22 0.622 0 0.0119
## zero_start_prop zero_end_prop lambda_guerrero kpss_stat kpss_pvalue pp_stat
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 2.00 0.747 0.01 -14.8
## pp_pvalue ndiffs nsdiffs bp_stat bp_pvalue lb_stat lb_pvalue var_tiled_var
## <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.01 1 1 18.0 0.0000225 18.1 0.0000208 0.809
## var_tiled_mean shift_level_max shift_level_index shift_var_max shift_var_index
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.192 164. 45 60503. 42
## shift_kl_max shift_kl_index spectral_entropy n_crossing_points
## <dbl> <dbl> <dbl> <int>
## 1 26.9 40 0.804 132
## longest_flat_spot stat_arch_lm
## <int> <dbl>
## 1 5 0.224
Tidy workflow
The steps in a tidy forecasting workflow:
# Set the training data from January 1 through September 30
train <- flight1 %>%
filter(date<'2013-10-01')
new_data <- flight1 %>%
filter(date>="2013-10-01")
# Fit the models
flights_fit <- train %>%
model(
Mean = MEAN(flights),
`Naïve` = NAIVE(flights),
`Seasonal Naïve` = SNAIVE(flights)
)
# Generate forecasts
flight_forecast_1 <-
flights_fit %>%
forecast(new_data = new_data) %>%
as_tsibble(key = key, index = date)
# Plot forecasts against actual values
flight_forecast_1 %>%
autoplot(.mean) +
autolayer(new_data, flights, colour = "black") +
labs(y = "Flights",
title = "Number of flights predicted by mean, naïve, and seasonal naïve") +
guides(colour = guide_legend(title = "forecast"))
Comments not to use classical decomposition from the text:
While classical decomposition is still widely used, it is not recommended, as there are now several much better methods.